Skip to content

Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)#315

Merged
cocohearts merged 1 commit intoopenai:mainfrom
jfprincz:submission/11l-partialrope-lateqat-1.1248
Mar 23, 2026
Merged

Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)#315
cocohearts merged 1 commit intoopenai:mainfrom
jfprincz:submission/11l-partialrope-lateqat-1.1248

Conversation

@jfprincz
Copy link
Copy Markdown
Contributor

@jfprincz jfprincz commented Mar 21, 2026

Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)

val_bpb: 1.1248 (sliding window, stride=64) | 15.6 MB | 8xH100 SXM, 600s

Progress from prior submissions

PR #70 PR #164 PR #198 PR #287 This Delta vs #287
val_bpb (sliding) 1.1659 (s256) 1.1524 (s256) 1.1318 (s64) 1.1271 (s64) 1.1248 (s64) -0.0023
Layers 9 9 11 11 11
Params 21.8M 22.4M 26.8M 26.8M 26.8M
Artifact 14.9 MB 15.4 MB 15.7 MB 15.5 MB 15.6 MB +0.1 MB

Two new techniques on top of PR #287's 11-layer stack.

Key additions over PR #287

Change Impact
Partial RoPE (16 of 64 dims) Apply rotary embeddings to only 25% of head dimensions. Remaining dims use position-free attention, improving generalization. Zero new parameters.
LN Scale RMSNorm outputs scaled by 1/sqrt(layer_idx+1). Damps deeper layers' contributions, stabilizing training. Zero new parameters.

Everything else from PR #287 carries forward: 11 layers, XSA on last 4 layers, EMA (0.997), OrthoInit + muP, 3x MLP, int6 mixed quant + zstd-22, WD=0.04, SmearGate, BigramHash(2048), FA3, seq 2048, tuned Muon.

Results

Metric Value
Pre-quant val_bpb 1.1418
Int6 roundtrip val_bpb 1.1485
Int6 sliding val_bpb (s64) 1.1248
Steps completed (600s cap) 7,051
Step time 85ms
Model params 26,829,913
Artifact size 15,612,308 bytes

Reproducibility (3 seeds)

Seed Steps Sliding s64 Artifact
2025 7,051 1.1248 15,612,308
42 7,061 1.1250 15,528,666
1337 7,063 1.1253 15,639,340

Mean: 1.1250 | Variance: 0.0005 | Submitted: seed 2025

Run command

NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=2048 XSA_LAST_N=4 \
EMA_ENABLED=1 EMA_DECAY=0.997 SWA_ENABLED=0 \
ROPE_DIMS=16 LN_SCALE=1 LATE_QAT=1 QAT_THRESHOLD=0.1 \
MUON_WD=0.04 ADAM_WD=0.04 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3000 \
ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Note on Late QAT

The submitted code includes a Late QAT flag (LATE_QAT=1) intended to enable STE int6 fake-quantization in the final 4% of training. Post-submission analysis (credit: @152334H) revealed that torch.compile constant-folds the CastedLinear._qat_enabled class attribute at first trace, so the STE branch is dead-code-eliminated and never activates during training. Late QAT had no effect on the results. The score is driven entirely by Partial RoPE and LN Scale.

@himanalot
Copy link
Copy Markdown

yes! great job this is sort of where i went too

bopmite added a commit to bopmite/parameter-golf that referenced this pull request Mar 21, 2026
saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 21, 2026
robinojw pushed a commit to robinojw/parameter-golf that referenced this pull request Mar 21, 2026
- Add FA3 > FA2 > SDPA attention backend dispatch
- FA2 wrapper uses @torch.compiler.disable + fullgraph=False
- FA3 uses fullgraph=True (compatible with torch.compile)
- Default FP16_KEEP_NAME_PATTERNS empty (quantize everything, matches PR openai#315)
- Add pod_setup.sh with FA3/FA2 install flow
- Add build_fa3_wheel.sh for pre-building FA3 on cheap 1xH100
filipviz added a commit to filipviz/parameter-golf that referenced this pull request Mar 21, 2026
Rename folder to today's date. Replace train_gpt.py with the new
baseline from PR openai#315 (11L XSA4 + EMA + Partial RoPE + Late QAT,
1.1248 BPB). Previous script preserved as previous_train_gpt.py.
Update README with PR lineage and new baseline context.
filipviz added a commit to filipviz/parameter-golf that referenced this pull request Mar 21, 2026
…unner

Port per-head gated attention (12ch, 2*sigmoid) into the PR openai#315
train_gpt.py (11L XSA4 + EMA + Partial RoPE + Late QAT, 1.1248 BPB).
Update run script to use PR openai#315 config for both baseline and experiment.
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 21, 2026
felipe-parodi added a commit to felipe-parodi/parameter-golf that referenced this pull request Mar 21, 2026
- Rebased train_gpt.py on PR openai#315 (1.1248 BPB SOTA)
- Added SGD TTT and causal TTT variant
- Added gradient-guided adaptive quantization (int5/int6/int7)
- Added z-loss regularization
- Updated plan with current landscape and run commands
@jfprincz jfprincz force-pushed the submission/11l-partialrope-lateqat-1.1248 branch from dfb05a5 to 2951651 Compare March 21, 2026 21:01
@jfprincz jfprincz changed the title Record: 11L Partial RoPE + LN Scale + EMA + Late QAT + XSA4 (val_bpb: 1.1248) Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248) Mar 21, 2026
saml212 added a commit to saml212/parameter-golf that referenced this pull request Mar 21, 2026
Merged records from all experiment branches into one working branch.
Updated CLAUDE.md with current competitive landscape and next priorities.
Rewrote idea bank with tiered roadmap for closing the gap to openai#315.
felipe-parodi added a commit to felipe-parodi/parameter-golf that referenced this pull request Mar 21, 2026
alia-abbas added a commit to alia-abbas/parameter-golf that referenced this pull request Mar 21, 2026
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 21, 2026
torch.compile constant-folds CastedLinear._qat at first trace.
Credit: @152334H via PR openai#315.
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 21, 2026
torch.compile constant-folds CastedLinear._qat at first trace.
Credit: @152334H via PR openai#315.
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 21, 2026
torch.compile constant-folds CastedLinear._qat at first trace.
Credit: @152334H via PR openai#315.
charmquark1984 added a commit to charmquark1984/parameter-golf that referenced this pull request Mar 21, 2026
13 techniques tested that did NOT work on PR openai#315 base:
- Causal TTT (3 variants): neutral on EMA+XSA base
- MTP: +0.028 BPB, throughput penalty kills it
- INT4: 0.06 BPB quant gap wipes out param advantage
- Canon layers: 48% step overhead not compensated
- Memory tokens, gradient-guided quant, cautious WD,
  L1 regularization, label smoothing, 1M batch, full QAT

4 positive findings:
- EMA > SWA by 0.003 BPB (3-seed verified)
- Weight decay directly controls artifact size
- 786K > 524K batch by 0.004 BPB
- FA3 Hopper: 15-20% more steps at same wallclock

Best verified result: 1.1257 BPB (PR openai#315 reproduction)
Includes 12 training logs for verification.
charmquark1984 added a commit to charmquark1984/parameter-golf that referenced this pull request Mar 21, 2026
13 techniques tested that did NOT work on PR openai#315 base:
- Causal TTT (3 variants): neutral on EMA+XSA base
- MTP: +0.028 BPB, throughput penalty kills it
- INT4: 0.06 BPB quant gap wipes out param advantage
- Canon layers: 48% step overhead not compensated
- Memory tokens, gradient-guided quant, cautious WD,
  L1 regularization, label smoothing, 1M batch, full QAT

4 positive findings:
- EMA > SWA by 0.003 BPB (3-seed verified)
- Weight decay directly controls artifact size
- 786K > 524K batch by 0.004 BPB
- FA3 Hopper: 15-20% more steps at same wallclock

Best verified result: 1.1257 BPB (PR openai#315 reproduction)
Includes 12 training logs for verification.
charmquark1984 added a commit to charmquark1984/parameter-golf that referenced this pull request Mar 21, 2026
13 techniques tested that did NOT work on PR openai#315 base:
- Causal TTT (3 variants): neutral on EMA+XSA base
- MTP: +0.028 BPB, throughput penalty kills it
- INT4: 0.06 BPB quant gap wipes out param advantage
- Canon layers: 48% step overhead not compensated
- Memory tokens, gradient-guided quant, cautious WD,
  L1 regularization, label smoothing, 1M batch, full QAT

4 positive findings:
- EMA > SWA by 0.003 BPB (3-seed verified)
- Weight decay directly controls artifact size
- 786K > 524K batch by 0.004 BPB
- FA3 Hopper: 15-20% more steps at same wallclock

Best verified result: 1.1257 BPB (PR openai#315 reproduction)
Includes 12 training logs for verification.
turazashvili added a commit to turazashvili/parameter-golf that referenced this pull request Mar 22, 2026
Safe config matching PR openai#315 proven techniques:
- 11 layers, MLP 3x (1536), BigramHash 2048
- Muon backend_steps=5, momentum=0.99 (proven by all top PRs)
- XSA on last 4 layers, Partial RoPE 16/64, LN Scale, Late QAT
- EMA decay=0.997 every 4 steps via torch._foreach_lerp_
- CUDA_DEVICE_MAX_CONNECTIONS=1 for multi-GPU overlap
- SmearGate, OrthoInit, int5 MLP/int6 attention, zstd-22
EthanYangTW added a commit to EthanYangTW/parameter-golf that referenced this pull request Mar 22, 2026
…le, EMA, Late QAT, TTT

Major rewrite targeting top-5 leaderboard:
- 11 layers (from 10), BigramHash reduced to 10240 to fit 16MB
- XSA (Exclusive Self-Attention) on last 4 layers
- Partial RoPE: 16/64 head dims get position encoding
- LN Scale: 1/sqrt(layer+1) dampening on deeper layers
- EMA (decay=0.997) replaces SWA
- Late QAT: STE int6 enabled only in final 4% of training
- TTT: 25-epoch SGD on val data post-quantization
- FA3 auto-detection with SDPA fallback
- Reverted SwiGLU back to relu² (confirmed worse by openai#340, openai#344)
mrdavtan added a commit to mrdavtan/parameter-golf that referenced this pull request Mar 23, 2026
… 3 seeds)

AdamW TTT with cosine lr decay over 30 epochs and per-layer lr groups
(3x for MLP output projections, 0.5x for input projections). 34 TTT
configurations tested. FINDINGS.md documents 31 experiments including
negative results on codebook quantization, symmetry-transport, layer
dropping, focal loss, and KL divergence TTT.

Builds on PRs openai#162, openai#180, openai#77, openai#398, openai#442, openai#417, openai#315.
alia-abbas added a commit to alia-abbas/parameter-golf that referenced this pull request Mar 23, 2026
@cocohearts cocohearts merged commit cdabe13 into openai:main Mar 23, 2026
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 23, 2026
ROPE_DIMS=16: apply rotary to 25% of head dims, rest position-free
LN_SCALE=1: scale RMSNorm output by 1/sqrt(layer+1)
Both env-var gated, default off — existing runs unaffected.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 23, 2026
ROPE_DIMS=16: apply rotary to 25% of head dims, rest position-free
LN_SCALE=1: scale RMSNorm output by 1/sqrt(layer+1)
Both env-var gated, default off — existing runs unaffected.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nvemuri4649 pushed a commit to thanushpatlolla/parameter-golf that referenced this pull request Mar 27, 2026
…e-lateqat-1.1248

Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants